IGNITE-28671 Describe healthy cluster behavior in general tips guide#13130
IGNITE-28671 Describe healthy cluster behavior in general tips guide#13130w3ll1ngt wants to merge 3 commits into
Conversation
|
|
||
| == What healthy cluster behavior looks like | ||
|
|
||
| A healthy Ignite cluster is not defined by a single latency, CPU, or memory number. In practice, it is a cluster whose topology is stable, whose cluster state and baseline match the intended deployment, whose partitions are not lost or divergent, whose rebalancing and checkpointing complete in bounded time, and whose execution queues and memory pools return to a steady level after short-lived spikes. Ignite exposes these signals through built-in metrics, system views, and the control script rather than through a single aggregate health score. |
There was a problem hiding this comment.
A healthy Ignite cluster is not defined by a single latency, CPU, or memory number. - very strange comment
There was a problem hiding this comment.
yea, thx. Looks wierd at second glance, i agree. i was intended to say smth like: "healthy cluster could not bejust defined by some simple memory metrics (number), but rather... whole complex system etc"
|
|
||
| A healthy Ignite cluster is not defined by a single latency, CPU, or memory number. In practice, it is a cluster whose topology is stable, whose cluster state and baseline match the intended deployment, whose partitions are not lost or divergent, whose rebalancing and checkpointing complete in bounded time, and whose execution queues and memory pools return to a steady level after short-lived spikes. Ignite exposes these signals through built-in metrics, system views, and the control script rather than through a single aggregate health score. | ||
|
|
||
| When checking whether a cluster is healthy, start with topology and cluster state. The cluster should be in the expected state, usually ACTIVE, and the number of server and client nodes should be stable. If native persistence is enabled, the baseline should also be in the expected shape: for a stable deployment, the nodes that are expected to be online should appear online both in baseline-related metrics and in the SYS.BASELINE_NODES system view. Frequent unexpected topology changes are not normal and should be treated as a sign of node instability or network problems. |
There was a problem hiding this comment.
usually ACTIVE - need cross link to ACTIVE
There was a problem hiding this comment.
thank you, links added
|
|
||
| Checkpointing and transactions should also remain bounded. Checkpoint activity can slow the cluster down, so LastCheckpointDuration should be monitored together with dirty pages and disk behavior. Transactions and queries can legitimately take longer during bursts, but healthy steady-state behavior means that lock-holding transactions, long-running transactions, and long-running SQL queries do not accumulate over time. If long transactions repeatedly block partition map exchange, use transaction timeout settings such as TxTimeoutOnPartitionMapExchange and investigate the application path that keeps transactions open. | ||
|
|
||
| Finally, check the underlying JVM and critical workers. Ignite treats IgniteOutOfMemoryException, OutOfMemoryError, system worker termination, system worker hangs, and cluster node segmentation as critical failures. A healthy cluster should not emit blocked system-critical worker messages, and JVM resource pools should stay comfortably below exhaustion. In practice, monitor heap usage, direct buffer usage, and open file descriptors continuously, because all three are finite pools and approaching their limits usually means the node is already close to a failure condition rather than merely under benign load. |
There was a problem hiding this comment.
as critical failures - and what does it mean ?
There was a problem hiding this comment.
Completely refurbish this phrase. Before that, i meant to think about critical failures as smth that triggers FailureHandler. Thank you
|



Thank you for submitting the pull request to the Apache Ignite.
In order to streamline the review of the contribution
we ask you to ensure the following steps have been taken:
The Contribution Checklist
The description explains WHAT and WHY was made instead of HOW.
The following pattern must be used:
IGNITE-XXXX Change summarywhereXXXX- number of JIRA issue.(see the Maintainers list)
the
green visaattached to the JIRA ticket (see tabPR Checkat TC.Bot - Instance 1 or TC.Bot - Instance 2)Notes
If you need any help, please email dev@ignite.apache.org or ask anу advice on http://asf.slack.com #ignite channel.